First Post

First Blog Post.

This first blog post will be a replication of the code developed by Vik Paruchuri in regards to understanding how race may effect SAT scores of students in New York.

This will be cleaned when I get a chance. For now I'm just going to show the python code.

In [1]:
import pandas as pd
import numpy as np

files = ["ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv", "hs_directory.csv", "math_test_results.csv", "sat_results.csv"]

data = {}
for item in files:
    d = pd.read_csv("Documents/DS/schools2/{0}".format(item))
    data[item.replace(".csv","")] = d
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
 in ()
----> 1 import pandas as pd
      2 import numpy as np
      3 
      4 files = ["ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv", "hs_directory.csv", "math_test_results.csv", "sat_results.csv"]
      5 

ImportError: No module named 'pandas'
In [2]:
for k,v in data.items():
    print('\n' + k + '\n')
    print(v.head())
graduation

    Demographic     DBN                            School Name    Cohort  \
0  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2003   
1  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2004   
2  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2005   
3  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2006   
4  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL  2006 Aug   

   Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
0             5               s                         s                 s   
1            55              37                     67.3%                17   
2            64              43                     67.2%                27   
3            78              43                     55.1%                36   
4            78              44                     56.4%                37   

  Total Regents - % of cohort Total Regents - % of grads  \
0                           s                          s   
1                       30.9%                      45.9%   
2                       42.2%                      62.8%   
3                       46.2%                      83.7%   
4                       47.4%                      84.1%   

             ...            Regents w/o Advanced - n  \
0            ...                                   s   
1            ...                                  17   
2            ...                                  27   
3            ...                                  36   
4            ...                                  37   

  Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads  \
0                                  s                                 s   
1                              30.9%                             45.9%   
2                              42.2%                             62.8%   
3                              46.2%                             83.7%   
4                              47.4%                             84.1%   

  Local - n Local - % of cohort   Local - % of grads Still Enrolled - n  \
0         s                   s                    s                  s   
1        20               36.4%                54.1%                 15   
2        16                 25%  37.200000000000003%                  9   
3         7                  9%                16.3%                 16   
4         7                  9%                15.9%                 15   

  Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort  
0                            s               s                         s  
1                        27.3%               3                      5.5%  
2                        14.1%               9                     14.1%  
3                        20.5%              11                     14.1%  
4                        19.2%              11                     14.1%  

[5 rows x 23 columns]

ap_2010

      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  

sat_results

      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404   
1                     91                             383                 423   
2                     70                             377                 402   
3                      7                             414                 401   
4                     44                             390                 433   

  SAT Writing Avg. Score  
0                    363  
1                    366  
2                    370  
3                    359  
4                    384  

class_size

   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT   
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED   
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT   
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED   

  CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY)  \
0                                    -                                   -   
1                                    -                                   -   
2                                    -                                   -   
3                                    -                                   -   
4                                    -                                   -   

  SERVICE CATEGORY(K-9* ONLY)  NUMBER OF STUDENTS / SEATS FILLED  \
0                           -                               19.0   
1                           -                               21.0   
2                           -                               17.0   
3                           -                               17.0   
4                           -                               15.0   

   NUMBER OF SECTIONS  AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  \
0                 1.0                19.0                    19.0   
1                 1.0                21.0                    21.0   
2                 1.0                17.0                    17.0   
3                 1.0                17.0                    17.0   
4                 1.0                15.0                    15.0   

   SIZE OF LARGEST CLASS DATA SOURCE  SCHOOLWIDE PUPIL-TEACHER RATIO  
0                   19.0         ATS                             NaN  
1                   21.0         ATS                             NaN  
2                   17.0         ATS                             NaN  
3                   17.0         ATS                             NaN  
4                   15.0         ATS                             NaN  

math_test_results

      DBN Grade  Year      Category  Number Tested Mean Scale Score Level 1 #  \
0  01M015     3  2006  All Students             39              667         2   
1  01M015     3  2007  All Students             31              672         2   
2  01M015     3  2008  All Students             37              668         0   
3  01M015     3  2009  All Students             33              668         0   
4  01M015     3  2010  All Students             26              677         6   

  Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 # Level 4 %  \
0      5.1%        11     28.2%        20     51.3%         6     15.4%   
1      6.5%         3      9.7%        22       71%         4     12.9%   
2        0%         6     16.2%        29     78.4%         2      5.4%   
3        0%         4     12.1%        28     84.8%         1        3%   
4     23.1%        12     46.2%         6     23.1%         2      7.7%   

  Level 3+4 # Level 3+4 %  
0          26       66.7%  
1          26       83.9%  
2          31       83.8%  
3          29       87.9%  
4           8       30.8%  

hs_directory

      dbn                                        school_name       boro  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262              9              12   
1          X400    718-842-0687  718-589-9849              9              12   
2          X240    718-294-8111  718-294-8109              6              12   
3          M520  718-935-3477             NaN              9              10   
4          Q695    718-969-3155  718-969-3552              6              12   

  expgrade_span_min  expgrade_span_max  \
0               NaN                NaN   
1               NaN                NaN   
2               NaN                NaN   
3                 9               14.0   
4               NaN                NaN   

                         ...                          \
0                        ...                           
1                        ...                           
2                        ...                           
3                        ...                           
4                        ...                           

                                          priority02  \
0                    Then to New York City residents   
1  Then to New York City residents who attend an ...   
2  Then to Bronx students or residents who attend...   
3  Then to New York City residents who attend an ...   
4  Then to Districts 28 and 29 students or residents   

                                          priority03  \
0                                                NaN   
1                Then to Bronx students or residents   
2  Then to New York City residents who attend an ...   
3          Then to Manhattan students or residents     
4               Then to Queens students or residents   

                            priority04                       priority05  \
0                                  NaN                              NaN   
1      Then to New York City residents                              NaN   
2  Then to Bronx students or residents  Then to New York City residents   
3      Then to New York City residents                              NaN   
4      Then to New York City residents                              NaN   

  priority06  priority07 priority08  priority09 priority10  \
0        NaN         NaN        NaN         NaN        NaN   
1        NaN         NaN        NaN         NaN        NaN   
2        NaN         NaN        NaN         NaN        NaN   
3        NaN         NaN        NaN         NaN        NaN   
4        NaN         NaN        NaN         NaN        NaN   

                                          Location 1  
0  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...  
1  1110 Boston Road\nBronx, NY 10456\n(40.8276026...  
2  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...  
3  411 Pearl Street\nNew York, NY 10038\n(40.7106...  
4  160-20 Goethals Avenue\nJamaica, NY 11432\n(40...  

[5 rows x 58 columns]

demographics

      DBN                       Name  schoolyear fl_percent  frl_percent  \
0  01M015  P.S. 015 ROBERTO CLEMENTE    20052006       89.4          NaN   
1  01M015  P.S. 015 ROBERTO CLEMENTE    20062007       89.4          NaN   
2  01M015  P.S. 015 ROBERTO CLEMENTE    20072008       89.4          NaN   
3  01M015  P.S. 015 ROBERTO CLEMENTE    20082009       89.4          NaN   
4  01M015  P.S. 015 ROBERTO CLEMENTE    20092010                    96.5   

   total_enrollment prek   k grade1 grade2    ...     black_num black_per  \
0               281   15  36     40     33    ...            74      26.3   
1               243   15  29     39     38    ...            68      28.0   
2               261   18  43     39     36    ...            77      29.5   
3               252   17  37     44     32    ...            75      29.8   
4               208   16  40     28     32    ...            67      32.2   

  hispanic_num hispanic_per white_num white_per male_num male_per female_num  \
0          189         67.3         5       1.8    158.0     56.2      123.0   
1          153         63.0         4       1.6    140.0     57.6      103.0   
2          157         60.2         7       2.7    143.0     54.8      118.0   
3          149         59.1         7       2.8    149.0     59.1      103.0   
4          118         56.7         6       2.9    124.0     59.6       84.0   

  female_per  
0       43.8  
1       42.4  
2       45.2  
3       40.9  
4       40.4  

[5 rows x 38 columns]
In [3]:
data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis = 1)
#by setting axis = 1, we are applying our lambda function on each row (row-wise). So for every row, apply lambda. What is lambda?
#lambda takes the values in the CSD column and SCHOOL CODE column and appends them. This is done row by row, for all rows.
#The {0:02d}{1} is simple. It is equivalent to saying {0}{1}. The 02d just says, if it is a single digit value like "1" then put
# a "0" infront of the "1" (i.e make it 2digits). {0:03d} would mean 3 digits --> if x["CSD"] = 1, then 001
#{0:04d} would mean 4 digits --> if x["CSD"] = 1, then 0001; if x["CSD"] = 10, then 0010; if x["CSD"] = 100, then 0100, etc.
In [4]:
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
In [5]:
data["hs_directory"].head()
Out[5]:
dbn school_name boro building_code phone_number fax_number grade_span_min grade_span_max expgrade_span_min expgrade_span_max ... priority03 priority04 priority05 priority06 priority07 priority08 priority09 priority10 Location 1 DBN
0 17K548 Brooklyn School for Music & Theatre Brooklyn K440 718-230-6250 718-230-6262 9 12 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 883 Classon Avenue\nBrooklyn, NY 11225\n(40.67... 17K548
1 09X543 High School for Violin and Dance Bronx X400 718-842-0687 718-589-9849 9 12 NaN NaN ... Then to Bronx students or residents Then to New York City residents NaN NaN NaN NaN NaN NaN 1110 Boston Road\nBronx, NY 10456\n(40.8276026... 09X543
2 09X327 Comprehensive Model School Project M.S. 327 Bronx X240 718-294-8111 718-294-8109 6 12 NaN NaN ... Then to New York City residents who attend an ... Then to Bronx students or residents Then to New York City residents NaN NaN NaN NaN NaN 1501 Jerome Avenue\nBronx, NY 10452\n(40.84241... 09X327
3 02M280 Manhattan Early College School for Advertising Manhattan M520 718-935-3477 NaN 9 10 9 14.0 ... Then to Manhattan students or residents Then to New York City residents NaN NaN NaN NaN NaN NaN 411 Pearl Street\nNew York, NY 10038\n(40.7106... 02M280
4 28Q680 Queens Gateway to Health Sciences Secondary Sc... Queens Q695 718-969-3155 718-969-3552 6 12 NaN NaN ... Then to Queens students or residents Then to New York City residents NaN NaN NaN NaN NaN NaN 160-20 Goethals Avenue\nJamaica, NY 11432\n(40... 28Q680

5 rows × 59 columns

In [6]:
survey1 = pd.read_csv("Documents/DS/schools2/survey_75.txt", delimiter = '\t', encoding = "windows-1252")
survey2 = pd.read_csv("Documents/DS/schools2/survey_full.txt", delimiter = '\t', encoding = "windows-1252")
survey1['d75'] = True
survey2['d75'] = False
survey = pd.concat([survey1, survey2])
In [7]:
survey.head()
Out[7]:
N_p N_s N_t aca_p_11 aca_s_11 aca_t_11 aca_tot_11 bn com_p_11 com_s_11 ... t_q8c_1 t_q8c_2 t_q8c_3 t_q8c_4 t_q9 t_q9_1 t_q9_2 t_q9_3 t_q9_4 t_q9_5
0 244.0 8.0 81.0 8.5 6.3 6.4 7.1 K004 8.6 5.8 ... 38.0 55.0 6.0 1.0 5.4 13.0 31.0 27.0 22.0 8.0
1 115.0 97.0 43.0 8.8 7.7 7.2 7.9 K036 8.7 6.9 ... 38.0 50.0 8.0 5.0 1.6 60.0 30.0 5.0 3.0 3.0
2 189.0 131.0 75.0 8.1 8.5 8.0 8.2 K053 8.1 7.7 ... 46.0 49.0 3.0 1.0 4.5 25.0 28.0 21.0 18.0 7.0
3 150.0 70.0 52.0 8.9 8.8 6.3 8.0 K077 8.7 8.3 ... 28.0 50.0 16.0 6.0 5.6 24.0 14.0 26.0 32.0 4.0
4 117.0 69.0 45.0 8.3 7.3 6.0 7.2 K140 8.2 6.1 ... 30.0 45.0 15.0 10.0 2.3 55.0 21.0 12.0 7.0 5.0

5 rows × 2773 columns

In [8]:
survey["DBN"] = survey["dbn"]
survey_heads = ["DBN", "rr_s", "rr_t", "rr_p", "N_p", "N_s", "N_t", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_10", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
survey = survey.loc[:,survey_heads]
survey.head()
data["survey"] = survey
In [9]:
data["survey"].shape
Out[9]:
(1702, 23)
In [10]:
class_size = data['class_size']
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
In [11]:
class_size.head()
Out[11]:
DBN CSD NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS SCHOOLWIDE PUPIL-TEACHER RATIO
0 01M292 1 88.0000 4.000000 22.564286 18.50 26.571429 NaN
1 01M332 1 46.0000 2.000000 22.000000 21.00 23.500000 NaN
2 01M378 1 33.0000 1.000000 33.000000 33.00 33.000000 NaN
3 01M448 1 105.6875 4.750000 22.231250 18.25 27.062500 NaN
4 01M450 1 57.6000 2.733333 21.200000 19.40 22.866667 NaN
In [12]:
demographics = data["demographics"]
demographics = demographics[demographics["schoolyear"] == 20112012]
data["demographics"] = demographics
In [13]:
math = data["math_test_results"]
math = math[(math["Grade"] == "8")]
math = math[(math["Year"] == 2011)]
data["math_test_results"] = math
In [14]:
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
data["graduation"] = data['graduation'][data["graduation"]["Cohort"] == "2006"]
In [15]:
data["sat_results"].head()
Out[15]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355 404 363
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383 423 366
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377 402 370
3 01M458 FORSYTH SATELLITE ACADEMY 7 414 401 359
4 01M509 MARTA VALLE HIGH SCHOOL 44 390 433 384
In [16]:
sat_cols = ["SAT Critical Reading Avg. Score", "SAT Math Avg. Score", "SAT Writing Avg. Score"]
for val in sat_cols:
    data["sat_results"][val] = data["sat_results"][val].convert_objects(convert_numeric=True)

data["sat_results"]["sat_total"] = data["sat_results"][sat_cols[0]] + data["sat_results"][sat_cols[1]] + data["sat_results"][sat_cols[2]]
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  app.launch_new_instance()
In [17]:
lat = data["hs_directory"]["Location 1"].apply(lambda x: x.split("\n")[-1].replace("(","").replace(")","").split(",")[0])
lon = data["hs_directory"]["Location 1"].apply(lambda x: x.split("\n")[-1].replace("(","").replace(")","").split(",")[1])
# Now we are applying column wise (if we dont specify the axis it is assumed axis = 0 which means columnwise). For every element
# under the column Location 1 parse the value every time we have "\n" and put all the seperate parts in a list, then
# take the last element of this list which is (lat, lon) and then further clean.
In [18]:
lat.dtypes
Out[18]:
dtype('O')
In [19]:
data["hs_directory"]["lat"] = lat.convert_objects(convert_numeric = True)
data["hs_directory"]["lon"] = lon.convert_objects(convert_numeric = True)
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  if __name__ == '__main__':
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  from ipykernel import kernelapp as app
In [20]:
for k,v in data.items():
    print("\n" + k + "\n")
    print(v.head())
    print(v.shape)
graduation

     Demographic     DBN                            School Name Cohort  \
3   Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL   2006   
10  Total Cohort  01M448    UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   2006   
17  Total Cohort  01M450             EAST SIDE COMMUNITY SCHOOL   2006   
24  Total Cohort  01M509                MARTA VALLE HIGH SCHOOL   2006   
31  Total Cohort  01M515  LOWER EAST SIDE PREPARATORY HIGH SCHO   2006   

    Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
3             78              43                     55.1%                36   
10           124              53                     42.7%                42   
17            90              70                     77.8%                67   
24            84              47                       56%                40   
31           193             105                     54.4%                91   

   Total Regents - % of cohort Total Regents - % of grads  \
3                        46.2%                      83.7%   
10                       33.9%                      79.2%   
17         74.400000000000006%                      95.7%   
24                       47.6%                      85.1%   
31                       47.2%                      86.7%   

              ...            Regents w/o Advanced - n  \
3             ...                                  36   
10            ...                                  34   
17            ...                                  67   
24            ...                                  23   
31            ...                                  22   

   Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads  \
3                               46.2%                             83.7%   
10                              27.4%                             64.2%   
17                74.400000000000006%                             95.7%   
24                              27.4%                             48.9%   
31                              11.4%                               21%   

   Local - n Local - % of cohort Local - % of grads Still Enrolled - n  \
3          7                  9%              16.3%                 16   
10        11                8.9%              20.8%                 46   
17         3                3.3%               4.3%                 15   
24         7  8.300000000000001%              14.9%                 25   
31        14                7.3%              13.3%                 53   

   Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort  
3                         20.5%              11                     14.1%  
10                        37.1%              20       16.100000000000001%  
17                        16.7%               5                      5.6%  
24                        29.8%               5                        6%  
31                        27.5%              35       18.100000000000001%  

[5 rows x 23 columns]
(405, 23)

ap_2010

      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  
(258, 5)

sat_results

      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers  SAT Critical Reading Avg. Score  \
0                     29                            355.0   
1                     91                            383.0   
2                     70                            377.0   
3                      7                            414.0   
4                     44                            390.0   

   SAT Math Avg. Score  SAT Writing Avg. Score  sat_total  
0                404.0                   363.0     1122.0  
1                423.0                   366.0     1172.0  
2                402.0                   370.0     1149.0  
3                401.0                   359.0     1174.0  
4                433.0                   384.0     1207.0  
(478, 7)

class_size

      DBN  CSD  NUMBER OF STUDENTS / SEATS FILLED  NUMBER OF SECTIONS  \
0  01M292    1                            88.0000            4.000000   
1  01M332    1                            46.0000            2.000000   
2  01M378    1                            33.0000            1.000000   
3  01M448    1                           105.6875            4.750000   
4  01M450    1                            57.6000            2.733333   

   AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  SIZE OF LARGEST CLASS  \
0           22.564286                   18.50              26.571429   
1           22.000000                   21.00              23.500000   
2           33.000000                   33.00              33.000000   
3           22.231250                   18.25              27.062500   
4           21.200000                   19.40              22.866667   

   SCHOOLWIDE PUPIL-TEACHER RATIO  
0                             NaN  
1                             NaN  
2                             NaN  
3                             NaN  
4                             NaN  
(583, 8)

math_test_results

        DBN Grade  Year      Category  Number Tested Mean Scale Score  \
111  01M034     8  2011  All Students             48              646   
280  01M140     8  2011  All Students             61              665   
346  01M184     8  2011  All Students             49              727   
388  01M188     8  2011  All Students             49              658   
411  01M292     8  2011  All Students             49              650   

    Level 1 # Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 #  \
111        15     31.3%        22     45.8%        11     22.9%         0   
280         1      1.6%        43     70.5%        17     27.9%         0   
346         0        0%         0        0%         5     10.2%        44   
388        10     20.4%        26     53.1%        10     20.4%         3   
411        15     30.6%        25       51%         7     14.3%         2   

    Level 4 % Level 3+4 # Level 3+4 %  
111        0%          11       22.9%  
280        0%          17       27.9%  
346     89.8%          49        100%  
388      6.1%          13       26.5%  
411      4.1%           9       18.4%  
(499, 16)

hs_directory

      dbn                                        school_name       boro  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262              9              12   
1          X400    718-842-0687  718-589-9849              9              12   
2          X240    718-294-8111  718-294-8109              6              12   
3          M520  718-935-3477             NaN              9              10   
4          Q695    718-969-3155  718-969-3552              6              12   

  expgrade_span_min  expgrade_span_max    ...      \
0               NaN                NaN    ...       
1               NaN                NaN    ...       
2               NaN                NaN    ...       
3                 9               14.0    ...       
4               NaN                NaN    ...       

                        priority05 priority06 priority07 priority08  \
0                              NaN        NaN        NaN        NaN   
1                              NaN        NaN        NaN        NaN   
2  Then to New York City residents        NaN        NaN        NaN   
3                              NaN        NaN        NaN        NaN   
4                              NaN        NaN        NaN        NaN   

  priority09  priority10                                         Location 1  \
0        NaN         NaN  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...   
1        NaN         NaN  1110 Boston Road\nBronx, NY 10456\n(40.8276026...   
2        NaN         NaN  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...   
3        NaN         NaN  411 Pearl Street\nNew York, NY 10038\n(40.7106...   
4        NaN         NaN  160-20 Goethals Avenue\nJamaica, NY 11432\n(40...   

      DBN        lat        lon  
0  17K548  40.670299 -73.961648  
1  09X543  40.827603 -73.904475  
2  09X327  40.842414 -73.916162  
3  02M280  40.710679 -74.000807  
4  28Q680  40.718810 -73.806500  

[5 rows x 61 columns]
(435, 61)

demographics

       DBN                                              Name  schoolyear  \
6   01M015  P.S. 015 ROBERTO CLEMENTE                           20112012   
13  01M019  P.S. 019 ASHER LEVY                                 20112012   
20  01M020  PS 020 ANNA SILVER                                  20112012   
27  01M034  PS 034 FRANKLIN D ROOSEVELT                         20112012   
35  01M063  PS 063 WILLIAM MCKINLEY                             20112012   

   fl_percent  frl_percent  total_enrollment prek    k grade1 grade2  \
6         NaN         89.4               189   13   31     35     28   
13        NaN         61.5               328   32   46     52     54   
20        NaN         92.5               626   52  102    121     87   
27        NaN         99.7               401   14   34     38     36   
35        NaN         78.9               176   18   20     30     21   

      ...     black_num black_per hispanic_num hispanic_per white_num  \
6     ...            63      33.3          109         57.7         4   
13    ...            81      24.7          158         48.2        28   
20    ...            55       8.8          357         57.0        16   
27    ...            90      22.4          275         68.6         8   
35    ...            41      23.3          110         62.5        15   

   white_per male_num male_per female_num female_per  
6        2.1     97.0     51.3       92.0       48.7  
13       8.5    147.0     44.8      181.0       55.2  
20       2.6    330.0     52.7      296.0       47.3  
27       2.0    204.0     50.9      197.0       49.1  
35       8.5     97.0     55.1       79.0       44.9  

[5 rows x 38 columns]
(1509, 38)

survey

      DBN  rr_s  rr_t  rr_p    N_p    N_s   N_t  saf_p_11  com_p_11  eng_p_11  \
0  75K004  38.0    90    72  244.0    8.0  81.0       9.1       8.6       8.4   
1  75K036  70.0    69    44  115.0   97.0  43.0       9.2       8.7       8.6   
2  75K053  94.0    97    53  189.0  131.0  75.0       8.8       8.1       8.0   
3  75K077  95.0    65    55  150.0   70.0  52.0       9.4       8.7       8.8   
4  75K140  77.0    70    42  117.0   69.0  45.0       8.6       8.2       7.9   

      ...      eng_t_10  aca_t_11  saf_s_11  com_s_11  eng_s_11  aca_s_11  \
0     ...           NaN       6.4       7.6       5.8       7.6       6.3   
1     ...           NaN       7.2       6.6       6.9       7.3       7.7   
2     ...           NaN       8.0       8.2       7.7       8.6       8.5   
3     ...           NaN       6.3       8.6       8.3       8.6       8.8   
4     ...           NaN       6.0       4.7       6.1       6.0       7.3   

   saf_tot_11  com_tot_11  eng_tot_11  aca_tot_11  
0         8.0         7.0         7.5         7.1  
1         7.5         7.6         7.3         7.9  
2         8.2         7.8         7.8         8.2  
3         8.3         7.7         7.9         8.0  
4         6.3         6.7         6.3         7.2  

[5 rows x 23 columns]
(1702, 23)
In [21]:
flat_data_names = [k for k,v in data.items()]
flat_data_names[3] = "d"
flat_data_names[0] = "math_test_results"
flat_data_names[3] = "graduation"
flat_data_names
Out[21]:
['math_test_results',
 'ap_2010',
 'sat_results',
 'graduation',
 'math_test_results',
 'hs_directory',
 'demographics',
 'survey']
In [22]:
flat_data = [data[k] for k in flat_data_names]
full = flat_data[0]
for i, f in enumerate(flat_data[1:]):
    name = flat_data_names[i+1]
    print(name)
    print(len(f["DBN"]) - len(f["DBN"].unique()))
    join_type = "inner"
    if name in ["sat_results", "graduation", "ap_2010"]:
        join_type = "outer"
    if name not in ["math_test_results"]:
        full = full.merge(f, on="DBN", how=join_type)

full.shape
    
ap_2010
1
sat_results
0
graduation
0
math_test_results
0
hs_directory
0
demographics
0
survey
0
Out[22]:
(377, 167)
In [23]:
ap_cols = ["AP Test Takers ", "Total Exams Taken", "Number of Exams with scores 3 4 or 5"]

for col in ap_cols:
    full[col] = full[col].convert_objects(convert_numeric = True)    
full[ap_cols] = full[ap_cols].fillna(value=0)
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
In [24]:
full["school_dist"] = full["DBN"].apply(lambda x: x[:2])
In [25]:
full = full.fillna(full.mean())
In [26]:
full.corr()["sat_total"]
Out[26]:
Year                                             NaN
Number Tested                           8.480643e-02
AP Test Takers                          5.623083e-01
Total Exams Taken                       5.523038e-01
Number of Exams with scores 3 4 or 5    5.561146e-01
SAT Critical Reading Avg. Score         9.867981e-01
SAT Math Avg. Score                     9.727446e-01
SAT Writing Avg. Score                  9.879493e-01
sat_total                               1.000000e+00
Total Cohort                            3.204298e-01
grade_span_max                          7.543147e-18
expgrade_span_max                                NaN
zip                                    -7.440723e-02
total_students                          4.037090e-01
number_programs                         1.131847e-01
lat                                    -1.195460e-01
lon                                    -1.387982e-01
schoolyear                                       NaN
frl_percent                            -6.963302e-01
total_enrollment                        3.633211e-01
ell_num                                -1.544852e-01
ell_percent                            -3.971109e-01
sped_num                                3.402716e-02
sped_percent                           -4.389156e-01
asian_num                               4.705871e-01
asian_per                               5.662172e-01
black_num                               2.282950e-02
black_per                              -2.879651e-01
hispanic_num                            2.327924e-02
hispanic_per                           -3.924851e-01
white_num                               4.503638e-01
white_per                               6.072655e-01
male_num                                3.218594e-01
male_per                               -1.045303e-01
female_num                              3.834069e-01
female_per                              1.045737e-01
rr_s                                    2.338289e-01
rr_t                                   -3.283554e-02
rr_p                                    5.015197e-02
N_p                                     4.180960e-01
N_s                                     4.189883e-01
N_t                                     2.857141e-01
saf_p_11                                1.209745e-01
com_p_11                               -1.120529e-01
eng_p_11                                2.697281e-02
aca_p_11                                3.702127e-02
saf_t_11                                3.112331e-01
com_t_11                                8.499880e-02
eng_t_10                                         NaN
aca_t_11                                1.383281e-01
saf_s_11                                3.330764e-01
com_s_11                                1.934145e-01
eng_s_11                                2.097384e-01
aca_s_11                                3.296402e-01
saf_tot_11                              3.154647e-01
com_tot_11                              8.540874e-02
eng_tot_11                              1.081932e-01
aca_tot_11                              1.932637e-01
Name: sat_total, dtype: float64
In [36]:
import folium
from folium import plugins

base_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start = 10)
marker_cluster = folium.MarkerCluster().add_to(base_map)

for name, row in full.iterrows():
    folium.Marker([row["lat"], row["lon"]], popup="{0}: {1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster)
 
base_map.create_map('schools.html')
base_map
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\ipykernel\__main__.py:10: FutureWarning: Map.create_map is deprecated. Use Map.save instead
Out[36]:
In [38]:
schools_heatmap = folium.Map(location=[full["lat"].mean(), full["lon"].mean()], zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row['lat'], row['lon']] for name, row in full.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap
Out[38]:
In [42]:
district_group = full.groupby('school_dist').agg(np.mean)
district_group.reset_index(inplace=True)
district_group['school_dist'] = district_group['school_dist'].apply(lambda x: str(int(x)))
In [43]:
district_group.head()
Out[43]:
school_dist Year Number Tested AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5 SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_total ... eng_t_10 aca_t_11 saf_s_11 com_s_11 eng_s_11 aca_s_11 saf_tot_11 com_tot_11 eng_tot_11 aca_tot_11
0 1 2011.0 78.319767 52.166667 74.500000 33.500000 441.833333 473.333333 439.333333 1354.500000 ... NaN 7.500000 6.771524 6.169608 6.740731 7.450357 7.433333 6.816667 7.116667 7.683333
1 2 2011.0 73.557228 68.411765 108.745098 76.019608 429.285326 445.880097 426.974319 1302.139742 ... NaN 7.588235 6.935866 6.251326 6.693027 7.417689 7.529412 6.733333 7.070588 7.568627
2 3 2011.0 64.491950 74.692308 119.615385 72.000000 426.771500 436.643295 424.909462 1288.324257 ... NaN 7.000000 6.861538 6.346154 6.738462 7.446154 7.323077 6.646154 6.946154 7.446154
3 4 2011.0 68.899709 48.375000 62.375000 36.250000 402.438422 416.730088 404.994469 1224.162979 ... NaN 8.062500 7.000000 6.312500 6.875000 7.675000 7.787500 7.100000 7.325000 7.887500
4 5 2011.0 88.416944 48.857143 59.428571 32.857143 427.215339 438.262958 419.707965 1285.186262 ... NaN 6.957143 6.314286 6.000000 6.428571 7.285714 7.085714 6.414286 6.671429 7.314286

5 rows × 59 columns

In [58]:
def build_district_map(col):
    geo_path = "Documents/DS/schools2/School Districts.geojson"
    districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
    districts.geo_json(
        geo_path=geo_path,
        data=district_group,
        columns=["school_dist", col],
        key_on='feature.properties.school_dist',
        fill_color='YlGn',
        line_opacity=0.2,
        fill_opacity=0.7
    )
    districts.save("district.html")
    return districts

build_district_map('sat_total')   
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\folium\folium.py:504: UserWarning: This method is deprecated. Please use Map.choropleth instead.
  warnings.warn('This method is deprecated. '
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\folium\folium.py:506: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
  return self.choropleth(*args, **kwargs)
Out[58]:
In [48]:
%matplotlib inline

full.plot.scatter(x="total_enrollment", y="sat_total")
Out[48]:
In [54]:
full[(full['sat_total'] < 1000) & (full["total_enrollment"] < 1000)]["School Name"]
Out[54]:
38     INTERNATIONAL SCHOOL FOR LIBERAL ARTS
148                                      NaN
153    KINGSBRIDGE INTERNATIONAL HIGH SCHOOL
208                MULTICULTURAL HIGH SCHOOL
299      INTERNATIONAL COMMUNITY HIGH SCHOOL
310          BRONX INTERNATIONAL HIGH SCHOOL
320                                      NaN
323            HIGH SCHOOL OF WORLD CULTURES
326       BROOKLYN INTERNATIONAL HIGH SCHOOL
336    INTERNATIONAL HIGH SCHOOL AT PROSPECT
338               IT TAKES A VILLAGE ACADEMY
358    PAN AMERICAN INTERNATIONAL HIGH SCHOO
Name: School Name, dtype: object
In [55]:
full.plot.scatter(x='ell_percent', y='sat_total')
Out[55]:
In [59]:
build_district_map('ell_percent')
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\folium\folium.py:504: UserWarning: This method is deprecated. Please use Map.choropleth instead.
  warnings.warn('This method is deprecated. '
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\folium\folium.py:506: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
  return self.choropleth(*args, **kwargs)
Out[59]:
In [66]:
full.corr()["sat_total"][("rr_s, rr_t, rr_p, N_p, N_s, N_t, saf_tot_11, com_tot_11, eng_tot_11, aca_tot_11").split(", ")].plot.bar()
Out[66]:
In [73]:
full.corr()["sat_total"][["asian_per", "black_per", "hispanic_per", "white_per"]].plot.bar()
Out[73]:
In [74]:
full.corr()["ell_percent"]["hispanic_per"]
Out[74]:
0.40271378098065158
In [75]:
full.plot.scatter(x="ell_percent", y="hispanic_per")
Out[75]:
In [76]:
build_district_map("hispanic_per")
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\folium\folium.py:504: UserWarning: This method is deprecated. Please use Map.choropleth instead.
  warnings.warn('This method is deprecated. '
C:\Users\Haris Hyder\Anaconda3\lib\site-packages\folium\folium.py:506: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
  return self.choropleth(*args, **kwargs)
Out[76]:
In [78]:
full.corr()["sat_total"][["male_per", "female_per"]].plot.bar()
Out[78]:
In [79]:
full.plot.scatter(x="female_per", y="sat_total")
Out[79]:
In [82]:
full[(full["sat_total"] > 1400) & (full["female_per"] > 65)]["School Name"]
Out[82]:
5             PROFESSIONAL PERFORMING ARTS HIGH SCH
97                    ELEANOR ROOSEVELT HIGH SCHOOL
105                    TALENT UNLIMITED HIGH SCHOOL
116            FIORELLO H. LAGUARDIA HIGH SCHOOL OF
234                     TOWNSEND HARRIS HIGH SCHOOL
255    FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL
270                  BARD HIGH SCHOOL EARLY COLLEGE
Name: School Name, dtype: object
In [83]:
full["avg_ap"] = full["AP Test Takers "] / full["total_enrollment"]
In [84]:
full.corr()["sat_total"]["avg_ap"]
Out[84]:
0.61225099329866561
In [85]:
full.plot.scatter(x="avg_ap", y='sat_total')
Out[85]:
In [87]:
full[(full['avg_ap'] > 0.3) & (full['sat_total'] > 1700)]["School Name"]
Out[87]:
97             ELEANOR ROOSEVELT HIGH SCHOOL
103                   STUYVESANT HIGH SCHOOL
162             BRONX HIGH SCHOOL OF SCIENCE
166    HIGH SCHOOL OF AMERICAN STUDIES AT LE
181           BROOKLYN TECHNICAL HIGH SCHOOL
234              TOWNSEND HARRIS HIGH SCHOOL
248    QUEENS HIGH SCHOOL FOR THE SCIENCES A
265      STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: School Name, dtype: object
In [ ]:
 

blogroll

social